Control-quasi-independent-points guided speculative multithreading

ABSTRACT

A method for generating instructions to facilitate control-quasi-independent-point multithreading is provided. A spawn point and control-quasi-independent-point are determined. An instruction stream is generated to partition a program so that portions of the program are parallelized by speculative threads. A method of performing control-quasi-independent-point guided speculative multithreading includes spawning a speculative thread when the spawn point is encountered. An embodiment of the method further includes performing speculative precomputation to determine a live-in value for the speculative thread.

BACKGROUND

[0001] 1. Technical Field

[0002] The present invention relates generally to information processingsystems and, more specifically, to spawning of speculative threads forspeculative multithreading.

[0003] 2. Background Art

[0004] In order to increase performance of information processingsystems, such as those that include microprocessors, both hardware andsoftware techniques have been employed. One software approach that hasbeen employed to improve processor performance is known as“multithreading.” In multithreading, an instruction stream is split intomultiple instruction streams that can be executed in parallel. Insoftware-only multithreading approaches, such as time-multiplexmultithreading or switch-on-event multithreading, the multipleinstruction streams are alternatively executed on the same sharedprocessor.

[0005] Increasingly, multithreading is supported in hardware. Forinstance, in one approach, processors in a multi-processor system, suchas a chip multiprocessor (“CMP”) system, may each act on one of themultiple threads simultaneously. In another approach, referred to assimultaneous multithreading (“SMT”), a single physical processor is madeto appear as multiple logical processors to operating systems and userprograms. That is, each logical processor maintains a complete set ofthe architecture state, but nearly all other resources of the physicalprocessor, such as caches, execution units, branch predictors controllogic and buses are shared. The threads execute simultaneously and makebetter use of shared resources than time-multiplex multithreading orswitch-on-event multithreading.

[0006] For those systems, such as CMP and SMT multithreading systems,that provide hardware support for multiple threads, one or more threadsmay be idle during execution of a single-threaded application. Utilizingotherwise idle threads to speculatively parallelize the single-threadedapplication can increase speed of execution, but it is often-timesdifficult to determine which sections of the single-threaded applicationshould be speculatively executed by the otherwise idle thread.Speculative thread execution of a portion of code is only beneficial ifthe application's control-flow ultimately reaches that portion of code.In addition, speculative thread execution can be delayed, and renderedless effective, due to latencies associated with data fetching.Embodiments of the method and apparatus disclosed herein address theseand other concerns related to speculative multithreading.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The present invention may be understood with reference to thefollowing drawings in which like elements are indicated by like numbers.These drawings are not intended to be limiting but are instead providedto illustrate selected embodiments of a method and apparatus forfacilitating control-quasi-independent-points guided speculativemultithreading.

[0008]FIG. 1 is a flowchart illustrating at least one embodiment of amethod for generating instructions for control-quasi-independent-pointsguided speculative multithreading.

[0009]FIG. 2 is a flowchart illustrating at least one embodiment of amethod for identifying control-quasi-independent-points for speculativemultithreading.

[0010]FIG. 3 is a data flow diagram showing at least one embodiment of amethod for generating instructions for control-quasi-independent-pointsguided speculative multi threading.

[0011]FIG. 4 is a flowchart illustrating at least one embodiment of asoftware compilation process.

[0012]FIG. 5 is a flowchart illustrating at least one embodiment of amethod for generating instructions to precompute speculative-thread'slive-in values for control-quasi-independent-points guided speculativemultithreading.

[0013]FIGS. 6 and 7 are flowcharts illustrating at least one embodimentof a method for performing speculative multithreading using acombination of control-quasi-independent-points guided speculativemultithreading and speculative precomputation of live-in values.

[0014]FIG. 8 is a block diagram of a processing system capable ofperforming at least one embodiment of control-quasi-independent-pointsguided speculative multithreading.

DETAILED DISCUSSION

[0015]FIG. 1 is a flowchart illustrating at least one embodiment of amethod for generating instructions to facilitatecontrol-quasi-independent-points (“CQIP”) guided speculativemultithreading. For at least one embodiment of the method 100,instructions are generated to reduce the execution time in asingle-threaded application through the use of one or more simultaneousspeculative threads. The method 100 thus facilitates the parallelizationof a portion of an application's code through the use of thesimultaneous speculative threads. A speculative thread, referred to asthe spawnee thread, executes instructions that are ahead of the codebeing executed by the thread that performed the spawn. The thread thatperformed the spawn is referred to as the spawner thread. For at leastone embodiment, the spawnee thread is an SMT thread that is executed bya second logical processor on the same physical processor as the spawnerthread. One skilled in the art will recognize that the method 100 may beutilized in any multithreading approach, including SMT, CMPmultithreading or other multiprocessor multithreading, or any otherknown multithreading approach that may encounter idle thread contexts.

[0016] Traditional software program parallelization techniques areusually applied to numerical and regular applications. However,traditional automated compiler parallelization techniques do not performwell for irregular or non-numerical applications such as those thatrequire accesses to memory based on linked data structures. Nonetheless,various studies have demonstrated that these irregular and integerapplications still have large amounts of thread level parallelism thatcould be exploited through judicious speculative multithreading. Themethod 100 illustrated in FIG. 1 provides a mechanism to partitionsingle-threaded application into sub-tasks that can be speculativelyexecuted using additional threads.

[0017] In contrast to some types of traditional speculativemultithreading techniques, which spawn speculative threads based onknown control dependent structures such as calls or loops, the method100 of FIG. 1 determines spawn points based on control independency, yetmakes provision for handling data flow dependency among parallelthreads. The following discussion explains that the method 100 selectsthread spawning points based on an analysis of control independence, inan effort to achieve speculative parallelization with minimalmisspecualtion in relation to control flow. In addition, the methodaddresses data flow dependency in that live-in values are supplied. Forat least one embodiment, live-in values are predicted using a valueprediction approach. In at least one other embodiment, live-in valuesare pre-computed using speculative precomputation based on backwarddependency analysis.

[0018]FIG. 1 illustrates that a method 100 for generating instructionsto facilitate CQIP-guided multithreading includes identification 10 ofspawning pairs that each include a spawn point and a CQIP. At block 50,the method 100 provides for calculation of live-in values for datadependences in the helper thread to be spawned. At block 60,instructions are generated such that, when the instructions are executedby a processor, a speculative thread is spawned and speculativelyexecutes a selected portion of the application's code.

[0019]FIG. 2 is a flowchart further illustrating at least one embodimentof identification 10 of control-quasi-independent-points for speculativemultithreading. FIG. 2 illustrates that the method 10 performs 210profile analysis. During the analysis 210, a control flow graph (see,e.g., 330 of FIG. 3) is generated to represent flow of control among thebasic blocks associated with the application. The method 10 thencomputes 220 reaching probabilities. That is, the method 10 computes 220the probability that a second basic block will be reached duringexecution of the source program, if a first basic block is executed.Candidate basic blocks are identified 230 as potential spawn pairs basedon the reaching probabilities previously computed 220. At block 240, thecandidates are evaluated according to selected metrics in order toselect one or more spawning pairs. Each of blocks 210 (performingprofile analysis), 220 (computing reaching probabilities), 230(identifying candidate basic blocks), and 240 (selecting spawning pair)are described in further detail below in connection with FIG. 3.

[0020]FIG. 3 is a data flow diagram. The flow of data is represented inrelation to an expanded flowchart that incorporates the actionsillustrated in both FIGS. 1 and 2. FIG. 3 illustrates that, for at leastone embodiment of the method 100 illustrated in FIG. 1, certain data isconsulted, and certain other data is generated, during execution of themethod 100. FIG. 3 illustrates that a profile 325 is accessed to aid inprofile analysis 210. Also, a control flow graph 330 (“CFG”) is accessedto aid in computation 220 of reaching probabilities.

[0021] Brief reference to FIG. 4 illustrates that the profile 325 istypically generated by one or more compilation passes prior to executionof the method. In FIG. 4, a typical compilation process 400 isrepresented. The process 400 involves two compiler-performed passes 405,410 and also involves a test run 407 that is typically initiated by auser, such as a software programmer. During a first pass 405, thecompiler (e.g., 808 in FIG. 8) receives as an input the source code 415for which compilation is desired. The compiler then generatesinstrumented binary code 420 that corresponds to the source code 415.The instrumented binary code 420 includes, in addition to the binary forthe source code 415 instructions, extra binary code that causes, duringa run of the instrumented code 420, statistics to be collected andrecorded in a profile 325 and a call graph 424. When a user initiates atest run 407 of the instrumented binary code 420, the profile 325 andcall graph 424 are generated. During the normal compilation pass 410,the profile 325 is used as an input into the compiler and a binary codefile 340 is generated. The profile 325 may be used, for example, by thecompiler during the normal compilation pass 410 to aid with performanceenhancements such as speculative branch prediction.

[0022] Each of the passes 405, 410, and the test run 407, are optionalto the method 100 in that any method of generating the informationrepresented by profile 325 may be utilized. Accordingly, first pass 405and normal pass 410, as well as test run 407, are depicted with brokenlines in FIG. 4 to indicate their optional nature. One skilled in theart will recognize that any method of generating the informationrepresented by profile 325 may be utilized, and that the actions 405,407, 410 depicted in FIG. 4 are provided for illustrative purposes only.One skilled in the art will also recognize that the method 100 describedherein may be applied, in an alternative embodiment, to a binary file.That is, the profile 325 may be generated for a binary file rather thana high-level source code file, and the profile analysis 210 (FIG. 2) maybe performed using such binary-based profile as an input.

[0023] Returning to FIG. 3, one can see that the profile analysis 210utilizes the profile 325 as an input and generates a control flow graph330 as an output. The method 100 builds the CFG 330 during the profileanalysis 210 such that each node of the CFG 330 represents a basic blockof the source program. Edges between nodes of the CFG 330 representpossible control flows among the basic blocks. For at least oneembodiment, edges of the CFG 330 are weighted with the frequency thatthe corresponding control flow has been followed (as reflected in theprofile 325). Accordingly, the edges are weighted by the probabilitythat one basic block follows the other, without revisiting the latternode. In contrast to other CFG representations, such as “edge profiling”which represents only intra-procedural edges, at least one embodiment ofthe CFG 330 created during profile analysis 210 includes representationof inter-procedural edges.

[0024] For at least one embodiment, the CFG 330 is pruned to simplifythe CFG 330 and control its size. The least frequently executed basicblocks are pruned from the CFG 330. To determine which nodes shouldremain in the CFG 330, and which should be pruned, the weights of theedges to a block are used to determine the basic block's executioncount. The basic blocks are ordered by execution count, and are selectedto remain in the CFG 330 according to their execution count. For atleast one embodiment, the basic blocks are chosen from highest to lowerexecution count until a predetermined threshold percentage of the totalexecuted instructions are included in the CFG 330. Accordingly, afterweighting and pruning, the most frequently-executed basic blocks arerepresented in the CFG 330.

[0025] For at least one embodiment, the predetermined thresholdpercentage of executed instructions chosen to remain in the CFG 330during profile analysis 20 is ninety (90) percent. For selectedembodiments, the threshold may be varied to numbers higher or lower thanninety percent, based on factors such as application requirements and/ormachine resource availability. For instance, if a relatively largenumber of hardware thread contexts are supported by the machineresources, then a lower threshold may be chosen in order to facilitatemore aggressive speculation.

[0026] In order to retain control flow information about pruned basicblocks, the following processing may also occur during profile analysis210. When a node is pruned from the CFG 330, an edge from a predecessorto the pruned node is transformed to one or more edges from thatpredecessor to the node's successor(s). Also, an edge from the prunednode to a successor is transformed to one or more edges from the prunednode's predecessor(s) to the successor. If, during this transformation,an edge is transformed into multiple edges, the weight of the originaledge is proportionally apportioned across the new edges.

[0027]FIG. 3 illustrates that the CFG 330 produced during profileanalysis 210 is utilized to compute 220 reaching probabilities. At leastone embodiment of reaching probability computation 220 utilizes theprofile CFG 330 as an input and generates a reaching probability matrix335 as an output. As stated above, as used herein the “reachingprobability” is the probability that a second basic block will bereached after execution of a first basic block, without revisiting thefirst basic block. For at least one embodiment, the reachingprobabilities computed at block 220 are stored in a two-dimensionalsquare matrix 335 that has as many rows and columns as nodes in the CFG330. Each element of the matrix represents the probability to executethe basic block represented by the column after execution of the basicblock represented by the row.

[0028] For at least one embodiment, this probability is computed as thesum of the frequencies for all the various sequences of basic blocksthat exist from the source node to the destination node. In order tosimplify the computation, a constraint is imposed such that the sourceand destination nodes may only appear once in the sequence of nodes asthe first and last nodes, respectively, and may not appear again asintermediate nodes. (For determining the probability of reaching a basicblock again after it has been executed, the basic block will appeartwice—as both the source and destination nodes). Other basic blocks arepermitted to appear more than once in the sequence.

[0029] At block 230, the reaching probability matrix 335 is traversed toevaluate pairs of basic blocks and identify those that are candidatesfor a spawning pair. As used herein, the term “spawning pair” refers toa pair of instructions associated with the source program. One of theinstructions is a spawn point, which is an instruction within a firstbasic block. For at least one embodiment, the spawn point is the firstinstruction of the first basic block.

[0030] The other instruction is a target point and is, morespecifically, a control quasi-independent point (“CQIP”). The CQIP is aninstruction within a second basic block. For at least one embodiment,the CQIP is the first instruction of the second basic block. A spawnpoint is the instruction in the source program that, when reached, willactivate creation of a speculative thread at the CQIP, where thespeculative thread will start its execution.

[0031] For each element in the reaching probability matrix 335, twobasic blocks are represented. The first block includes a potential spawnpoint, and the second block includes a potential CQIP. An instruction(such as the first instruction) of the basic block for the row is thepotential spawn point. An instruction (such as the first instruction) ofthe basic block for the column is the potential CQIP. Each element ofthe reaching probability matrix 335 is evaluated, and those elementsthat satisfy certain selection criteria are chosen as candidates forspawning pairs. For at least one embodiment, the elements are evaluatedto determine those pairs whose probability is higher than a certainpredetermined threshold; that is, the probability to reach the controlquasi-independent point after execution of the spawn point is higherthan a given threshold. This criterion is designed to minimize spawningof speculative threads that are not executed. For at least oneembodiment, a pair of basic blocks associated with an element of thereaching probability matrix 335 is considered as a candidate for aspawning pair if its reaching probability is higher than 0.95

[0032] A second criterion for selection of a candidate spawning pair isthe average number of instructions between the spawn point and the CQIP.Ideally, a minimum average number of instructions should exist betweenthe spawning point and the CQIP in order to reduce the relative overheadof thread creation. If the distance is too small, the overhead of threadcreation may outweigh the benefit of run-ahead execution because thespeculative thread will not run far enough ahead. For at least oneembodiment, a pair of basic blocks associated with an element of thereaching probability matrix 335 is considered as a candidate for aspawning pair if the average number of instructions between then isgreater than 32 instructions.

[0033] Distance between the basic blocks may be additionally stored inthe matrix 335 and considered in the identification 230 of spawning paircandidates. For at least one embodiment, this additional information maybe calculated during profile analysis 210 and included in each elementof the reaching probability matrix 335. The average may be calculated asthe sum of the number of instructions executed by each sequence of basicblocks, multiplied by their frequency.

[0034] At block 240, the spawning pair candidates are evaluated based onanalysis of one or more selected metrics. These metrics may beprioritized. Based on the evaluation of the candidate spawning pairs inrelation to the prioritized metrics, one or more spawning pairs areselected.

[0035] The metrics utilized at block 240 may include the minimum averagedistance between the basic blocks of the potential spawning pair(described above), as well as an evaluation of mispredicted branches,load misses and/or instruction cache misses. The metrics may alsoinclude additional considerations. One such additional consideration isthe maximum average distance between the basic blocks of the potentialspawning pair. It should be noted that there are also potentialperformance penalties involved with having the average number ofinstructions between the spawn point and CQIP be too large. Accordingly,the selection of spawning pairs may also impose a maximum averagedistance. If the distance between the pair is too large, the speculativethread may incur stalls in a scheme where the speculative thread haslimited storage for speculative values. In addition, if the sizes ofspeculative threads are sufficiently dissimilar, speculative threads mayincur stalls in a scheme where the speculative thread cannot commit itsstates until it becomes the non-speculative thread (see discussion of“join point” in connection with FIGS. 6 and 7, below). Such stalls arelikely to result in ineffective holding of critical resources thatotherwise would be used by non-speculative threads to make forwardprogress.

[0036] Another additional consideration is the number of dependentinstructions that the speculative thread includes in relation to theapplication code between the spawning point and the CQIP. Preferably,the average number of speculative thread instructions dependent onvalues generated by a previous thread (also referred to as “live-ins”)should be relatively low. A smaller number of dependent instructionsallow for more timely computation of the live-in values for thespeculative thread.

[0037] In addition, for selected embodiments it is preferable that arelatively high number of the live-in values for the speculative threadare value-predictable. For those embodiments that use value predictionto provide for calculation 50 of live-in values (discussed furtherbelow), value-predictability of the live-in values facilitates fastercommunication of live-in values, thus minimizing overhead of spawningwhile also allowing correctness and accuracy of speculative threadcomputation.

[0038] It is possible that the candidate spawning pairs identified atblock 230 may include several good candidates for CQIP's associated witha given spawn point. That is, for a given row of the reachingprobability matrix 335, more than one element may be selected as acandidate spawning pair. In such case, during the metrics evaluation atblock 240, the best CQIP for the spawn point is selected because, for agiven spawn point, a speculative thread will be spawned at only oneCQIP. In order to choose the best CQIP for a given spawn point, thepotential CQIP's identified at block 230 are prioritized according tothe expected benefit.

[0039] In at least one alternative embodiment, if there are sufficienthardware thread resources, more than one CQIP can be chosen for acorresponding spawn point. In such case, multiple concurrent, albeitmutually exclusive, speculative threads may be spawned and executedsimultaneously to perform “eager” execution of speculative threads. Thespawning condition for these multiple CQIPs can be examined andverified, after the speculative threads have been executed, to determinethe effectiveness of the speculation. If one of these multiplespeculative threads proves to be good speculation, and another bad, thenthe results of the former can be reused by the main thread while theresults of the latter may be discarded.

[0040] In addition to those spawning pairs selected according to themetrics evaluation, at least one embodiment of the method 100 selects240 CALL return point pairs (pairs of subroutine calls and the returnpoints) if they satisfy the minimum size constraint. These pairs mightnot otherwise be selected at block 240 because the reaching probabilityfor such pairs is sometimes too low to satisfy the selection criteriadiscussed above in connection with candidate identification 230. Inparticular, if a subroutine is called from multiple locations, it willhave multiple predecessors and multiple successors in the CFG 330. Ifall the calls are executed a similar number of times, the reachingprobability of any return point pair will be low since the graph 330will have multiple paths with similar weights.

[0041] At block 50, the method 100 provides for calculation of live-invalues for the speculative thread to be executed at the CQIP. By“provides for” it is meant that instructions are generated, whereinexecution of the generated instructions, possibly in conjunction withsome special hardware support, will result in calculation of a predictedlive-value to be used as an input by the spawnee thread. Of course,block 50 might determine that no live-in values are necessary. In suchcase, “providing for” calculation of live-in values simply entailsdetermining that no live-in values are necessary.

[0042] Predicting thread input values allows the processor to executespeculative threads as if they were independent. At least one embodimentof block 50 generates instructions to perform or trigger valueprediction. Any known manner of value prediction, including hardwarevalue prediction, may be implemented. For example, instructions may begenerated 50 such that the register values of the spawned thread arepredicted to be the same as those of the spawning thread at spawn time.

[0043] Another embodiment of the method 100 identifies, at block 50, aslice of instructions from the application's code that may be used forspeculative precomputation of one or more live-in values. While valueprediction is a promising approach, it often requires rather complexhardware support. In contrast, no additional hardware support isnecessary for speculative precomputation. Speculative precomputation canbe performed at the beginning of the speculative thread execution in anotherwise idle thread context, providing the advantage of minimizingmisspeculations of live-in values without requiring additional valueprediction hardware support. Speculative precomputation is discussed infurther detail below in connection with FIG. 5.

[0044]FIG. 5 illustrates an embodiment of the method 100 wherein block50 is further specified to identify 502 precomputation instructions tobe used for speculative precomputation of one or more live-in values.For at least one embodiment, a set of instructions, called a slice, iscomputed at block 502 to include only those instructions identified fromthe original application code that are necessary to compute the live-invalue. The slice therefore is a subset of instructions from the originalapplication code. The slice is computed by following the dependenceedges backward from the instruction including the live-in value untilall instructions necessary for calculation of the live-in value havebeen identified. A copy of the identified slice instructions isgenerated for insertion 60 into an enhanced binary file 350 (FIG. 3).

[0045]FIGS. 3 and 5 illustrate that the methods 100, 500 for generatinginstructions for CQIP-guided multithreading generate an enhanced binaryfile 350 at block 60. The enhanced binary file 350 includes the binarycode 340 for the original single-threaded application, as well asadditional instructions. A trigger instruction to cause the speculativethread to be spawned is inserted into the enhanced binary file 350 atthe spawning point (s) selected at block 240. The trigger instructioncan be a conventional instruction in the existing instruction set of aprocessor, denoted with special marks. Alternatively, the triggerinstruction can be a special instruction such as a fork or spawninstruction. Trigger instructions can be executed by any thread.

[0046] In addition, the instructions to be performed by the speculativethread are included in the enhanced binary file 350. These instructionsmay include instructions added to the original code binary file 340 forlive-in calculation, and also some instructions already in the originalcode binary file 340, beginning at the CQIP, that the speculative threadis to execute. That is, regarding the speculative-thread instructions inthe enhanced binary file 350, two groups of instructions may bedistinguished for each spawning pair, if the speculative thread is toperform speculative precomputation for live-in values. In contrast, fora speculative thread that is to use utilize value prediction for itslive-in values, only the latter group of instructions describedimmediately below appears in the enhanced binary file 350.

[0047] The first group of instructions are generated at block 50 (or502, see FIG. 5) and are incorporated 60 into the enhanced binary codefile 350 in order to provide for the speculative thread's calculation oflive-in values. For at least one embodiment, the instructions to beperformed by the speculative thread to pre-compute live-in values areappended at the end of the file 350, after those instructions associatedwith the original code binary file 340.

[0048] Such instructions do not appear for speculative threads that usevalue prediction. Instead, specialized value prediction hardware may beused for value prediction. The value prediction hardware is fired by thespawn instruction. When the processor executes a spawn instruction, thehardware initializes the speculative thread registers with the predictedlive-in value.

[0049] Regardless of whether the speculative thread utilizes valueprediction (no additional instructions in the enhanced binary file 350)or speculative precomputation (slice instructions in the enhanced binaryfile 350), the speculative thread is associated with the second group ofinstructions alluded to above. The second set of instructions areinstructions that already exist in the original code binary file 340.The subset of such instructions that are associated with the speculativethread are those instructions in the original code binary file 340starting at the CQIP. For speculative threads that utilize speculativepre-computation for live-ins, the precomputation slice (which may beappended at the end of the enhanced binary file) terminates with abranch to the corresponding CQIP, which causes the speculative thread tobegin executing the application code instructions at the CQIP. Forspeculative threads that utilize value prediction for live-in values,the spawnee thread begins execution of the application code instructionsbeginning at the CQIP.

[0050] In an alternative embodiment, the enhanced binary file 350includes, for the speculative thread, a copy of the relevant subset ofinstructions from the original application, rather than providing forthe speculative thread to branch to the CQIP instruction of the originalcode. However, the inventors have found the non-copy approach discussedin the immediate preceding paragraph, which is implemented withappropriate branch instructions, efficiently allows for reduced codesize.

[0051] Accordingly, the foregoing discussion illustrates that, for atleast one embodiment, method 100 is performed by a compiler 808 (FIG.8). In such embodiment, the method 100 represents an automated processin which a compiler identifies a spawn point and an associatedcontrol-quasi-independent point (“CQIP”) target for a speculativethread, generates the instructions to pre-compute its live-ins, andembeds a trigger at the spawn point in the binary. The pre-computationinstructions for the speculative thread are incorporated (such as, forexample, by appending) into an enhanced binary file 350. One skilled inthe art will recognize that, in alternative embodiments, the method 100may be performed manually such that one or more of 1) identifying CQIPspawning pairs 10, 2) providing for calculation of live-in values 50,and 3) modification of the main thread binary 60 may be performedinteractively with human intervention.

[0052] In sum, a method for identifying spawning pairs and adapting abinary file to perform control-quasi-independent points guidedspeculative multithreading has been described. An embodiment of themethod is performed by a compiler, which identifies proper spawn pointsand CQIP, provides for calculation of live-in values in speculativethreads, and generates an enhanced binary file.

[0053]FIGS. 6 and 7 illustrate at least one embodiment of a method 600for performing speculative multithreading using a combination ofcontrol-quasi-independent-points guided speculative multithreading andspeculative precomputation of live-in values. For at least oneembodiment, the method 600 is performed by a processor (e.g. 804 of FIG.8) executing the instructions in an enhanced binary code file (e.g., 350of FIG. 3). For the method 600 illustrated in FIGS. 6 and 7, it isassumed, that the enhanced binary code file has been generated accordingto the method illustrated in FIG. 5, such that instructions to performspeculative precomputation of live-in values have been identified 502and inserted into the enhanced binary file.

[0054]FIGS. 6 and 7 illustrate that, during execution of the enhancedbinary code file, multiple threads T₀, T₁, . . . T_(x) may be executingsimultaneously. The flow of control associated with each of thesemultiple threads is indicated by the notations T₀, T₁, and T_(x) on theedges between the blocks illustrated in FIGS. 6 and 7. One skilled inthe art will recognize that the multiple threads may be spawned from anon-speculative thread. Also, in at least one embodiment, a speculativethread may spawn one or more additional non-speculative successorthreads.

[0055]FIG. 6 illustrates that processing begins at 601, where the threadT₀ begins execution. At block 602, a check is made to determine whetherthe thread T₀ previously encountered a join point while it (T₀) wasstill speculative. Block 602 is discussed in further detail below. Oneskilled in the art will understand that block 602 will, of course,evaluate to “false” if the thread T₀ was never previously speculative.

[0056] If block 602 evaluates to “false”, then an instruction for thethread T₀ is executed at block 604. If a trigger instruction associatedwith a spawn point is encountered 606, then processing continues toblock 608. Otherwise, the thread T₀ continues execution at block 607. Atblock 607, it is determined whether a join point has been encountered inthe thread T₀. If neither a trigger instruction nor join point isencountered, then the thread T₀ continues to execute instructions 604until it reaches 603 the end of its instructions.

[0057] If a trigger instruction is detected at block 606, then aspeculative thread T₁ is spawned in a free thread context at block 608.If slice instructions are encountered by the speculative thread T₁ atblock 610, the processing continues at block 612. If not, thenprocessing continues at 702 (FIG. 7).

[0058] At block 612, slice instructions for speculative precomputationare iteratively executed until the speculative precomputation of thelive-in value is complete 614. In the meantime, after spawning thespeculative thread T₁ at block 608, the spawner thread T₀ continues toexecute 604 its instructions. FIG. 6 illustrates that, while thespeculative thread T₁ executes 612 the slice instructions, the spawnerthread continues execution 604 of its instructions until another spawnpoint is encountered 606, a join point is encountered 607, or theinstruction stream ends 603. Accordingly, the spawner thread T₀ and thespawnee thread T₁ execute in parallel during speculative precomputation.

[0059] When live-in computation is determined complete 614, or if noslice instructions for speculative precomputation are available to thespeculative thread T₁ 610, then processing continues at A in FIG. 7.

[0060]FIG. 7 illustrates that, at block 702, the speculative thread T₁executes instructions from the original code. At the first iteration ofblock 702, the CQIP instruction is executed. The execution 702 ofspawnee thread instructions is performed in parallel with the executionof the spawner thread code until a terminating condition is reached.

[0061] At block 708, the speculative thread T₁ checks for a terminatingcondition. The check 708 evaluates to “true” when the spawnee thread T₁has encountered a CQIP of an active, more speculative thread or hasencountered the end of the program. As long as neither condition istrue, the spawnee thread T₁ proceeds to block 710.

[0062] If the speculative thread T₁ determines 708 that a join point hasbeen reached, then it is theoretically ready to perform processing toswitch thread contexts with the more speculative thread (as discussedbelow in connection with block 720). However, at least one embodiment ofthe method 600 limits such processing to non-speculative threads.Accordingly, when speculative thread T₁ determines 708 that it hasreached the joint point of a more speculative, active thread, T₁ waits706 to continue processing until it (T₁) becomes non-speculative.

[0063] At block 710, the speculative thread T₁ determines whether aspawning point has been reached. If the 710 condition evaluates to“false”, then T₁ continues execution 702 of its instructions.

[0064] If a spawn point is encountered at block 710, then thread T₁creates 712 a new speculative thread T₁. Thread T₁ then continuesexecution 702 of its instructions, while new speculative thread T₁proceeds to continue speculative thread operation at block 610, asdescribed above in connection with speculative thread T₁. One skilled inthe art will recognize that, while multiple speculative threads areactive, each thread follows the logic described above in connection withT1 (blocks 610 through 614 and blocks 702 through 710 of FIGS. 6 and 7).

[0065] When the spawner thread T₀ reaches a CQIP of an active, morespeculative thread, then we say that a join point has been encountered.The join point of a thread is the control quasi-independent point atwhich an on-going speculative thread began execution. It should beunderstood that multiple speculative threads may be active at one time.Hence the terminology “more speculative.” A “more speculative” thread isa thread that is a spawnee of the reference thread (in this case, threadT₀) and includes any subsequently-spawned speculative thread in thespawnee's spawning chain.

[0066] Thus, the join point check 607 (FIG. 6) evaluates to true whenthe thread T₀ reaches the CQIP at which any on-going speculative threadbegan execution. One skilled in the art will recognize that, if multiplespeculative threads are simultaneously active, then any one of themultiple CQIP's for the active speculative threads could be reached atblock 607. For simplicity of illustration, FIG. 7 assumes that when T₀hits a join point at bock 607, the join point is associated with T₁, thenext thread in program order, which is the speculative thread whose CQIPhas been reached by the non-speculative thread T₀.

[0067] Upon reaching the join point at block 607 (FIG. 6), a thread T₀proceeds to block 703. The thread T₀ determines 703 if it is the nonspeculative active thread and, if not, waits until it becomes thenon-speculative thread.

[0068] When T₀ becomes non-speculative, it initiates 704 a verificationof the speculation performed by the spawnee thread T₁. For at least oneembodiment, verification 704 includes determining whether thespeculative live-in values utilized by the spawnee thread T₁ reflect theactual values computed by the spawner thread.

[0069] If the verification 704 fails, then T₁ and any other thread morespeculative than T₁ are squashed 730. Thread T₀ then proceeds to C (FIG.6) to continue execution of its instructions. Otherwise, if theverification 704 succeeds, then thread T₀ and thread T₁ proceed to block720. At block 720, the thread context where the thread T₀ has beenexecuting becomes free and is relinquished. Also, the speculative threadT₁ that started at the CQIP becomes the non-speculative thread andcontinues execution at C (FIG. 6).

[0070] Reference to FIG. 6 illustrates that the newly non-speculativethread T₀ checks at block 602 to determine whether it encountered a CQIPat block 708 (FIG. 6) while it was still speculative. If so, then thethread T0 proceeds to B in order to begin join point processing asdescribed above.

[0071] The combination of both CQIP-based spawning point selection andspeculative computation of live-in values illustrated in FIGS. 5, 6 and7 provide a multithreading method that helps improve the efficacy andaccuracy of speculative multithreading. Such improvements are achievedbecause data dependencies among speculative threads are minimized sincethe values of live-ins are computed before execution of the speculativethread.

[0072] In the preceding description, various aspects of a method andapparatus for facilitating control-quasi-independent-points guidedspeculative multithreading have been described. For purposes ofexplanation, specific numbers, examples, systems and configurations wereset forth in order to provide a more thorough understanding. However, itis apparent to one skilled in the art that the described method may bepracticed without the specific details. In other instances, well-knownfeatures were omitted or simplified in order not to obscure the method.

[0073] Embodiments of the method may be implemented in hardware,software, firmware, or a combination of such implementation approaches.Embodiments of the invention may be implemented as computer programsexecuting on programmable systems comprising at least one processor, adata storage system (including volatile and non-volatile memory and/orstorage elements), at least one input device, and at least one outputdevice. Program code may be applied to input data to perform thefunctions described herein and generate output information. The outputinformation may be applied to one or more output devices, in knownfashion. For purposes of this application, a processing system includesany system that has a processor, such as, for example; a digital signalprocessor (DSP), a microcontroller, an application specific integratedcircuit (ASIC), or a microprocessor.

[0074] The programs may be implemented in a high level procedural orobject oriented programming language to communicate with a processingsystem. The programs may also be implemented in assembly or machinelanguage, if desired. In fact, the method described herein is notlimited in scope to any particular programming language. In any case,the language may be a compiled or interpreted language

[0075] The programs may be stored on a storage media or device (e.g.,hard disk drive, floppy disk drive, read only memory (ROM), CD-ROMdevice, flash memory device, digital versatile disk (DVD), or otherstorage device) readable by a general or special purpose programmableprocessing system. The instructions, accessible to a processor in aprocessing system, provide for configuring and operating the processingsystem when the storage media or device is read by the processing systemto perform the procedures described herein. Embodiments of the inventionmay also be considered to be implemented as a machine-readable storagemedium, configured for use with a processing system, where the storagemedium so configured causes the processing system to operate in aspecific and predefined manner to perform the functions describedherein.

[0076] An example of one such type of processing system is shown in FIG.8. System 800 may be used, for example, to execute the processing for amethod of performing control-quasi-independent-points guided speculativemultithreading, such as the embodiments described herein. System 800 mayalso execute enhanced binary files generated in accordance with at leastone embodiment of the methods described herein. System 800 isrepresentative of processing systems based on the Pentium®, Pentium®Pro, Pentium® II, Pentium® III, Pentium® 4, and Itanium® and Itanium® IImicroprocessors available from Intel Corporation, although other systems(including personal computers (PCs) having other microprocessors,engineering workstations, set-top boxes and the like) may also be used.In one embodiment, sample system 800 may be executing a version of theWindows™ operating system available from Microsoft Corporation, althoughother operating systems and graphical user interfaces, for example, mayalso be used.

[0077] Referring to FIG. 8, processing system 800 includes a memorysystem 802 and a processor 804. Memory system 802 may store instructions810 and data 812 for controlling the operation of the processor 804. Forexample, instructions 810 may include a compiler program 808 that, whenexecuted, causes the processor 804 to compile a program 415 (FIG. 4)that resides in the memory system 802. Memory 802 holds the program tobe compiled, intermediate forms of the program, and a resulting compiledprogram. For at least one embodiment, the compiler program 808 includesinstructions to select spawning pairs and generate instructions toimplement CQIP-guided multithreading. For such embodiment, instructions810 may also include an enhanced binary file 350 (FIG. 3) generated inaccordance with at least one embodiment of the present invention.

[0078] Memory system 802 is intended as a generalized representation ofmemory and may include a variety of forms of memory, such as a harddrive, CD-ROM, random access memory (RAM), dynamic random access memory(DRAM), static random access memory (SRAM) and related circuitry. Memorysystem 802 may store instructions 810 and/or data 812 represented bydata signals that may be executed by processor 804. The instructions 810and/or data 812 may include code for performing any or all of thetechniques discussed herein. At least one embodiment of CQIP-guidedspeculative multithreading is related to the use of the compiler 808 insystem 800 to select spawning pairs and generate instructions asdiscussed above.

[0079] Specifically, FIG. 8 illustrates that compiler 808 may include aprofile analyzer module 820 that, when executed by the processor 804,analyzes a profile to generate a control flow graph as described abovein connection with FIG. 3. The compiler 808 may also include a matrixbuilder module 824 that, when executed by the processor 804, computes220 reaching probabilities and generates a reaching probabilities matrix335 as discussed above. The compiler 808 may also include a spawningpair selector module 826 that, when executed by the processor 804,identifies 230 candidate basic blocks and selects 240 one or morespawning pairs. Also, the compiler 808 may include a slicer module 822that identifies 502 (FIG. 5) instructions for a slice to be executed bya speculative thread in order to perform speculative precomputation oflive-in values. The compiler 808 may further include a code generatormodule 828 that, when executed by the processor 804, generates 60 anenhanced binary file 350 (FIG. 3).

[0080] While particular embodiments of the present invention have beenshown and described, it will be obvious to those skilled in the art thatchanges and modifications can be made without departing from the presentinvention in its broader aspects. The appended claims are to encompasswithin their scope all such changes and modifications that fall withinthe true scope of the present invention.

What is claimed is:
 1. A method of compiling a software program,comprising: selecting a spawning pair that includes a spawn point and acontrol-quasi-independent point (CQIP); providing for calculation of alive-in value for a speculative thread; and generating an enhancedbinary file that includes instructions, the instructions including atrigger instruction to cause spawning of the speculative thread at theCQIP.
 2. The method of claim 1, further comprising: performing profileanalysis.
 3. The method of claim 1, further comprising: computing aplurality of reaching probabilities.
 4. The method of claim 1, furthercomprising: identifying a plurality of candidate basic blocks.
 5. Themethod of claim 4, wherein: selecting a spawning pair further comprisesselecting the spawning pair from the plurality of candidate basicblocks.
 6. The method of claim 1, wherein: generating the enhancedbinary file further comprises embedding a trigger at a spawn pointassociated with the spawning pair.
 7. The method of claim 1, whereinselecting the spawning pair further comprises: selecting a spawning pairhaving at least a minimum average number of instructions between thespawn point and the CQIP of the spawning pair.
 8. The method of claim 3,wherein selecting the spawning pair further comprises: selecting aspawning pair having at least a minimum reaching probability.
 9. Themethod of claim 1, wherein providing for calculation of the live-invalue further comprises: providing an instruction to invoke hardwareprediction of the live-in value.
 10. The method of claim 1, whereinproviding for calculation of the live-in value further comprises:generating one or more instructions to perform speculativeprecomputation of the live-in values.
 11. The method of claim 1,wherein: selecting a spawning pair further comprises selecting a firstspawning pair and a second spawning pair; and generating an enhancedbinary file that includes instructions further comprises generating anenhanced binary file that includes a trigger instruction for eachspawning pair.
 12. An article comprising: a machine-readable storagemedium having a plurality of machine accessible instructions; wherein,when the instructions are executed by a processor, the instructionsprovide for selecting a spawning pair that includes a spawn point and acontrol-quasi-independent point (CQIP); providing for calculation of alive-in value for a speculative thread; and generating an enhancedbinary file that includes instructions, the instructions including atrigger instruction to cause spawning of a speculative thread at thecontrol-quasi-independent.
 13. The article of claim 12, wherein theinstructions further comprise: instructions that provide for performingprofile analysis.
 14. The article of claim 12, wherein the instructionsfurther comprise: instructions that provide for computing a plurality ofreaching probabilities.
 15. The article of claim 12, wherein theinstruction further comprise: instructions that provide for identifyinga plurality of candidate basic blocks.
 16. The article of claim 15,wherein: the instructions that provide for selecting a spawning pairfurther comprise instructions that provide for selecting the spawningpair from the plurality of candidate basic blocks.
 17. The article ofclaim 12, wherein: the instructions that provide for generating theenhanced binary file further comprise instructions that provide forembedding a trigger at a spawn point associated with the spawning pair.18. The article of claim 12, wherein the instructions that provide forselecting the spawning pair further comprise: instructions that providefor selecting a spawning pair having at least a minimum average numberof instructions between the spawn point and the CQIP of the spawningpair.
 19. The article of claim 14, wherein the instructions that providefor selecting the spawning pair further comprise: instructions thatprovide for selecting a spawning pair having at least a minimum reachingprobability.
 20. The article of claim 12, wherein the instructions thatprovide for providing for calculation of the live-in value furthercomprise: instructions that provide for providing an instruction toinvoke hardware prediction of the live-in value.
 21. The article ofclaim 12, wherein instructions that provide for providing forcalculation of the live-in value further comprise: instructions thatprovide for generating one or more instructions to perform speculativeprecomputation of the live-in values.
 22. A method, comprising:executing one or more instructions in a first instruction stream in anon-speculative thread; spawning a speculative thread at a spawn pointin the first instruction stream, wherein the computed probability ofreaching a control quasi-independent point during execution of the firstinstruction stream, after execution of the spawn point, is higher than apredetermined threshold; and simultaneously: executing in thespeculative thread a speculative thread instruction stream that includesa subset of the instructions in the first instruction stream, thespeculative thread instruction stream including thecontrol-quasi-independent point; and executing one or more instructionsin the first instruction stream following the spawn point.
 23. Themethod of claim 22, wherein: executing one or more instructions in thefirst instruction stream following the spawn point further comprisesexecuting instructions until the CQIP is reached.
 24. The method ofclaim 23, further comprising: determining, responsive to reaching theCQIP, whether speculative execution performed in the speculative threadis correct.
 25. The method of claim 24, further comprising: responsiveto determining the speculative execution performed in the speculativethread is correct, relinquishing the non-speculative thread.
 26. Themethod of claim 24, further comprising: responsive to determining thatthe speculative execution performed in the speculative thread is notcorrect, squashing the speculative thread.
 27. The method of claim 26,further comprising: responsive to determining that the speculativeexecution performed in the speculative thread is not correct, squashingall active successor threads, if any, of the speculative thread.
 28. Themethod of claim 22, wherein: the speculative thread instruction streamincludes a precomputation slice for the speculative computation of alive-in value.
 29. The method of claim 22, wherein: spawning thespeculative thread triggers hardware prediction of a live-in value. 30.The method of claim 28, wherein: the speculative thread instructionstream includes, after the precomputation slice, a branch instruction tothe CQIP.
 31. The method of claim 22, further comprising: spawning asecond speculative thread at a spawn point in the speculative threadinstruction stream.
 32. An article comprising: a machine-readablestorage medium having a plurality of machine accessible instructions;wherein, when the instructions are executed by a processor, theinstructions provide for executing one or more instructions in a firstinstruction stream in a non-speculative thread; spawning a speculativethread at a spawn point in the first instruction stream, wherein thecomputed probability of reaching a control quasi-independent pointduring execution of the first instruction stream, after execution of thespawn point, is higher than a predetermined threshold; andsimultaneously: executing in the speculative thread a speculative threadinstruction stream that includes a subset of the instructions in thefirst instruction stream, the speculative thread instruction streamincluding the control-quasi-independent point; and executing one or moreinstructions in the first instruction stream following the spawn point.33. The article of claim 32, wherein: the instructions that provide forexecuting one or more instructions in the first instruction streamfollowing the spawn point further comprise instructions that provide forexecuting instructions until the CQIP is reached.
 34. The article ofclaim 33, wherein the instructions further comprise: instructions thatprovide for determining, responsive to reaching the CQIP, whetherspeculative execution performed in the speculative thread is correct.35. The article of claim 34, wherein the instructions further comprise:instructions that provide for, responsive to determining that thespeculative execution performed in the speculative thread is correct,relinquishing the non-speculative thread.
 36. The article of claim 34,further comprising: instructions that provide for, responsive todetermining that the speculative execution performed in the speculativethread is not correct, squashing the speculative thread.
 37. The articleof claim 36, wherein the instructions further comprise: instructionsthat provide for, responsive to determining that the speculativeexecution performed in the speculative thread is not correct, squashingall active successor threads, if any, of the speculative thread.
 38. Thearticle of claim 32, wherein: the speculative thread instruction streamincludes a precomputation slice for the speculative computation of alive-in value.
 39. The article of claim 32, wherein: the instructionthat provides for spawning the speculative thread triggers hardwareprediction of a live-in value.
 40. The article of claim 38, wherein: thespeculative thread instruction stream includes, after the precomputationslice, a branch instruction to the CQIP.
 41. A compiler comprising: aspawning pair selector module to select a spawning pair that includes acontrol-quasi-independent point (“CQIP”) and a spawn point; and a codegenerator to generate an enhanced binary file that includes a triggerinstruction at the spawn point.
 42. The compiler of claim 41, wherein:the trigger instruction is to spawn a speculative thread to beginexecution at the CQIP.
 43. The compiler of claim 41, further comprising:a slicer to generate a slice for precomputation of a live-in value;wherein the code generator is further to include the precomputationslice in the enhanced binary file.
 44. The compiler of claim 41,wherein: the spawning pair selector module is further to select thespawning pair such that a computed probability of reaching thecontrol-quasi-independent point after execution of the spawn point ishigher than a predetermined threshold.
 45. The compiler of claim 44,further comprising: a matrix builder to compute the reaching probabilityfor the spawning pair.
 46. The compiler of claim 41, further comprising:a profile analyzer to build a control flow graph.
 47. The compiler ofclaim 41, wherein: the trigger instruction is to trigger hardware valueprediction for a live-in value.
 48. The compiler of claim 41, furthercomprising: a matrix builder to compute the reaching probability for thespawning pair.